The khmer software package: enabling efficient nucleotide sequence analysis

نویسندگان

  • Michael R. Crusoe
  • Hussien F. Alameldin
  • Sherine Awad
  • Elmar Boucher
  • Adam Caldwell
  • Reed Cartwright
  • Amanda Charbonneau
  • Bede Constantinides
  • Greg Edvenson
  • Scott Fay
  • Jacob Fenton
  • Thomas Fenzl
  • Jordan Fish
  • Leonor Garcia-Gutierrez
  • Phillip Garland
  • Jonathan Gluck
  • Iván González
  • Sarah Guermond
  • Jiarong Guo
  • Aditi Gupta
  • Joshua R. Herr
  • Adina Howe
  • Alex Hyer
  • Andreas Härpfer
  • Luiz Irber
  • Rhys Kidd
  • David Lin
  • Justin Lippi
  • Tamer Mansour
  • Pamela McA'Nulty
  • Eric McDonald
  • Jessica Mizzi
  • Kevin D. Murray
  • Joshua R. Nahum
  • Kaben Nanlohy
  • Alexander Johan Nederbragt
  • Humberto Ortiz-Zuazaga
  • Jeramia Ory
  • Jason Pell
  • Charles Pepe-Ranney
  • Zachary N. Russ
  • Erich Schwarz
  • Camille Scott
  • Josiah Seaman
  • Scott Sievert
  • Jared Simpson
  • Connor T. Skennerton
  • James Spencer
  • Ramakrishnan Srinivasan
  • Daniel Standage
  • James A. Stapleton
  • Susan R. Steinman
  • Joe Stein
  • Benjamin Taylor
  • Will Trimble
  • Heather L. Wiencko
  • Michael Wright
  • Brian Wyss
  • Qingpeng Zhang
  • en zyme
  • C. Titus Brown
  • Rob Patro
  • Daniel Katz
  • Daniel S. Katz
  • Ewan Birney
چکیده

The khmer package is a freely available software library for working efficiently with fixed length DNA words, or k-mers. khmer provides implementations of a probabilistic k-mer counting data structure, a compressible De Bruijn graph representation, De Bruijn graph partitioning, and digital normalization. khmer is implemented in C++ and Python, and is freely available under the BSD license at  https://github.com/dib-lab/khmer/.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure

K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix array...

متن کامل

Walking the Talk: Adopting and Adapting Sustainable Scientific Software Development processes in a Small Biology Lab

The khmer software project provides both research and production functionality for largescale nucleic-acid sequence analysis. The software implements several novel data structures and algorithms that perform data pre-fltering for common bioinformatics tasks, including sequence mapping and de novo assembly. Development is driven by a small lab with one full-time developer (MRC), as well as sever...

متن کامل

Efficient cardinality estimation for k-mers in large DNA sequencing data sets

We present an open implementation of the HyperLogLog cardinality estimation sketch for counting fixed-length substrings of DNA strings (“k-mers”). The HyperLogLog sketch implementation is in C++ with a Python interface, and is distributed as part of the khmer software package. khmer is freely available from https://github.com/dib-lab/khmer under a BSD License. The features presented here are in...

متن کامل

Crossing the streams: a framework for streaming analysis of short DNA sequencing reads

5 We present a semi-streaming algorithm for k-mer spectral analysis of 6 DNA sequencing reads, together with a derivative approach that is fully 7 streaming. The approach can also be applied to genomic, transcriptomic, 8 and metagenomic data sets. We develop two tools for short-read analysis 9 based on these approaches, a method for semi-streaming k-mer-based error 10 trimming, and a method for...

متن کامل

FFBSKAT: Fast Family-Based Sequence Kernel Association Test

The kernel machine-based regression is an efficient approach to region-based association analysis aimed at identification of rare genetic variants. However, this method is computationally complex. The running time of kernel-based association analysis becomes especially long for samples with genetic (sub) structures, thus increasing the need to develop new and effective methods, algorithms, and ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره 4  شماره 

صفحات  -

تاریخ انتشار 2015